Authors:
Tyler Young - 33565667
This is a classification task that consists of correctly classifying a variable 'Y' with respect to 23 other variables 'X1-X23'. Classification tasks can be defined as the task of learning a target function F that maps each attribute set X to one of the predefined class labels Y. Y is denoted as a credit default of either 1 or 0, or in other words yes or no, making it a classification task. The objective is to predict this variable using the 23 attributes. Since this a paired assignment, 6 data mining algorithms will be employed on the relevant credit dataset provided which has already been split into a train set and test set. The best model out of the 6 will be selected to finally evaluate on the test set, which is unseen data and provides a realistic benchmark of the models' performance relative to new data given to us.
To achieve this outcome, we'll be following the universal workflow of machine learning, which typically consists of the following:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)
# Scikit-Learn
import sklearn
assert sklearn.__version__ >= "0.20"
from sklearn.model_selection import train_test_split
# Common imports
import os
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
#from sklearn.linear_model
from sklearn.preprocessing import PolynomialFeatures
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
# Load dataframes from CSV files
train_df = pd.read_csv('creditdefault_train.csv')
test_df = pd.read_csv('creditdefault_test.csv')
Like any data mining task, the first step is to inspect the data to have a greater understanding of what we are dealing with before preprocessing it. We start off by using the head() function on both the train and test set, which shows a dataframe of the first 5 rows. As expected, Y values are either 1 for yes or 0 for no, whereas the other 23 variables are attributes to used to predict the outcome.
train_df.head()
test_df.head()
Training Dataframe Information
train_df.info()
By inspecting the data, we can see that all variables are of type int, which means we won't need to perform one hot encoding, which is used to transform categorical or object types into a suitable format for the algorithms to use.
Training Dataframe Described
train_df.describe()
This function is useful since it provides a summary of the statistical properties or characteristics of the dataset, which is particularly easy to read. It serves as a tool to identify whether there are any missing values in the dataset or any outliers that must be dealt with before continuing with the task.
train_df.isnull().sum().sum() #Checking if the training set dataframe has any empty values (NaN)
test_df.isnull().sum().sum() #Checking if the test set dataframe has any empty values (NaN)
This function is used to check for any missing values in both the training and test set, which would be problematic for model building had there been any. Typically, we could implement a simple imputer to fill any missing values with the mean or median, so that the classification models have more data to work with, whereby a lack of data can often lead to overfitting.
In addition to data inspection, we can always plot the data to visualise and gain insights. Data visualisation is critical for recognising any underlying patterns and extracting useful insights, which is especially true when dealing with complex, multivariate data. To start off, we will build a correlation heatmap to represent the correlations between different variables.
sns.set(rc={'figure.figsize':(20,20)})
train_df_corr = train_df.corr()
sns.heatmap(train_df_corr, annot=True, vmin=-1, vmax=1, center=0, cmap='coolwarm', square=True)
# Plot the correlation of Y with other variables
corr_with_Y = train_df_corr.loc['Y']
corr_with_Y = corr_with_Y.drop('Y') # Drop correlation with itself
corr_with_Y = corr_with_Y.sort_values(ascending=False)
plt.figure(figsize=(15,5))
sns.barplot(x=corr_with_Y.index, y=corr_with_Y)
plt.xticks(rotation=90)
plt.title('Correlation of Y with other variables')
plt.show()
The correlaiton heatmap is ideal for showing patterns and relationships between variables in the data. We can see how variables are correlated to Y when Y = 1. To reinforce this, a barplot can be seen beneath the heatmap, which shows the same correlations but in a different format. These two data visualisation methods provide as a useful tool for insight, especially when dealing with a dataset as large as this one. In particular, the heatmap shows that there are a cluster of similar values in the middle of the heatmap, specifically at X12 - X17, These are the values which are closest to Y=1.
trainDF_long = pd.melt(train_df, id_vars=["Y"], var_name="variable", value_name="value")
g = sns.FacetGrid(trainDF_long, col="Y", aspect=4)
g.map(sns.scatterplot, "variable", "value")
g.fig.subplots_adjust(top=3)
FacetGrids are another data visualisation tool to map multiple plots that are grouped according to the specified variable. The Facet Grid above demonstrates that when Y=1 in the training dataframe, the rest of the dataframe is more likely to have less outliers. Visually, we are able to see that when Y=0, there are major outliers such as, X19, having values be a lot higher than usual will most probably mean that Y=0.
sns.pairplot(train_df.sample(100), hue="Y")
This is yet another useful visualisation tool to explore relationships and patterns in the data. Since the data is multivariate, that is, data that involves more than two independent variables that produce a single outcome, it enables multivariate analysis, allowing us to identify correlations in the data, similar to the correlation heatmap. This pairplot takes 100 samples from each column and maps it onto 23x23 grid for each respective variable.
The next step in the task is to prepare the data accordingly, so that it's suitable for each model to use. The data first needs to be split into train and test sets, which has already been done, making the process easier. We also need to split the features and target variables into seperate sets, which will be called X_train/test, and y_train,test. This is needed so that the models can make predictions and be compared with the actual values (y_train/test).
Models will be built using the train set and then finally we will employ the test set on the most suitable model.
# Split data into X and y (features and target)
X_train = train_df.drop('Y', axis=1)
y_train = train_df['Y']
X_test = test_df.drop('Y', axis=1)
y_test = test_df['Y']
# Split the data into training and test sets
test_size = 0.2 # proportion of the data to use for testing
random_state = 42 # set a random seed for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=test_size, random_state=random_state)
The data has been prepared in a sufficient manner, and so the next step is to move onto building the models using the train set. Each model will start off with some predefined variables, followed by an alternate version of that same model with some parameters changed to yield a different result. Each model will also use cross validation, which is a common method in the field of machine learning which works by randomly splitting the training data into subsets called 'folds'. In this case, we will use 5 folds of cross validation throughout the model building process. This can be observed via the 'cv' parameter.
Since this is a classification task, we'll use accuracy as the main evaluation metric to judge how well each model has performed, and since we're using cross validation, we take the accuracy for 5 folds and then use that to formulate the mean.
The penultimate step in the model building process will consist of some fine-tuning in the form of GridSearch. GridSearch makes the model building process much more stream-lined by selecting the best combination of optimal hyperparameters for any given model. Instead of manually tuning hyperparameters to yield the best results, which is extremely tedious, this technique automates this process, allowing for greater efficiency.
Finally, we will plot each model on graph to show the accuracy with respect to some hyperparameter, relative to each model.
Results in training can widely vary depending on if the model has underfit or overfit on the training data. This can be for a number of reasons, but the main reason in most data mining tasks is due to noise in the data.
We will start off by building the most common algorithm in the field of machine learning and data mining, which is the Nearest Neighbour algorithm. The algorithm is part of a technique called instance-based learning, which works by comparing new instances of data with pre-existing instances of data in training. This is relative to the nearest neighbour algorithm, which operates by comparing a given test instance z to the number of k points closest to z. This value depends on the number of neighbours specified in the algorithm; for instance, in the case of k = 5 (where k = number of neighbours), it will identify the 5 closest neighbours to the specified point z and then makes a classification based on a majority rule or class.
Another important factor of the nearest neighbour algorithm is the distance. When designing this algorithm, the first step is to define the distance measure. Common distance measures include the Euclidean distance, Manhattan distance or Minkowski distance. The algorithm implemented by SciKit learn has a default power p of 2, which is the Euclidean distance. We will use this distance throughout training for the purpose of consistency.
An important note is that, in some cases this classifier is subject to overfitting due to factors such as k being too small on noisy training data. If k is too large, it can misclassify the test instance since it incorporates data points located far away from its neighbourhood.
For this reason, we start out by using the default parameter of n_neighbours = 5, which can be changed after the intial test with manual tuning and fine-tuning of hyperparameters using GridSearch.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
scores=cross_val_score(knn, X_train, y_train, scoring='accuracy', cv=5)
print('accuracy for 5 folds: ', scores)
print('mean: ', np.mean(scores))
We first use the get_params() function to gain insight on the KNN algorithm and understand what parameters are utilised.
params = knn.get_params() #Fetching parameters for KNN so that we know what to utilise for grid search and model tuning
print(params)
knn.n_neighbors = 1 #Trying with 1 neighbour instead of 5
scores=cross_val_score(knn, X_train, y_train, scoring='accuracy', cv=5)
print('accuracy for 5 folds: ', scores)
print('mean: ', np.mean(scores))
knn.n_neighbors = 5
knn.leaf_size = 1000 # Experimenting with a larger leaf size
scores=cross_val_score(knn, X_train, y_train, scoring='accuracy', cv=5)
print('accuracy for 5 folds: ', scores)
print('mean: ', np.mean(scores))
As expected, the model performs considerably worse when the number of neighbours (k) is too small, since it overfits on noisy training data. The default parameter n_neighbour of 5 produced the best results using cross validation. The next step will be to find the most optimal hyperparameter tuning using GridSearch.
The 3rd test experiments with a larger leaf size of 1000, compared to the default value of 30. This test had the same outcome as the first test which utilised the same number of neighbours. In hindsight, this parameter is subjective since it depends on the nature of the problem, and usually only affects the speed of model construction. For this reason, we won't be using this as a parameter in the fine tuning process.
The next step is GridSearch to automate the hyperparameter tuning process.
param_grid = [
{'n_neighbors': [1, 3, 5, 7, 9]} ] #Specifying hyperparameters using a param_grid
grid_search = GridSearchCV(knn, param_grid, cv=5, #SKLearn function for GridSearch, specifying the model and using the same parameters of CV and scoring
scoring='accuracy',
return_train_score=True,
n_jobs=-1) #Runs in parallel to number of jobs
grid_search.fit(X_train, y_train) #Fit grid search to features and labels
print('best parameter values', grid_search.best_params_)
print('best estimator', grid_search.best_estimator_)
# display performance result for each set of hyperparameters specified
cvres = grid_search.cv_results_ #Dictionary containing performance metrics
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.mean(mean_score), params)
Like that of data visualisation, plotting data on a graph is far more visually appealing as opposed to observing a plethora of performance metrics via the code above. Data plots for training results are also more useful for observing whether a model has overfit or underfit on the training data.
x = [1, 3, 5, 7, 9]
mean_scores = cvres["mean_test_score"]
std_scores = cvres["std_test_score"]
sns.set_style("darkgrid")
fig, ax = plt.subplots(figsize=(8, 6))
sns.lineplot(x=x, y=mean_scores, marker='o', ax=ax)
ax.fill_between(x, mean_scores - std_scores, mean_scores + std_scores, alpha=0.3)
ax.set_xlabel('Number of Neighbours')
ax.set_ylabel('Accuracy')
ax.set_title('Accuracy vs. Number of Neighbours')
plt.show()
Representative of the tests above, the accuracy is significantly worse when k is too small. It's evident that the accuracy increases when k increases, however after a few iterations it starts to plateau meaning more neighbours would likely result in underfitting. In short, this model is far too simple and will not see important patterns in the data and won't generalise well on the unseen data.
The next model we will be building is a decision tree classifier. This algorithm is a nonparametric approach in the contest of classification models, meaning it usually doesn't require prior assumptions in regard to probabiility distributions in a given dataset. Decision trees consist of three types of nodes:
Leaf nodes are assigned class labels while the former nodes contain attribute test conditions to separate data attributes depending on their respective characteristics. The algorithm starts at the root node, and depending on the result of the test condition (either yes or no), it follows the respective branch depending on the result of the test condition. This leads to an internal node which will repeat this step, or stop at a leaf node, which has no further nodes (children). These steps are repeated recursively until the stopping criterion is met, or when the tree reaches its maximum depth, and the class labels associated with the relevant leaf node is assigned to the record.
This model is far more complex relative to KNN, meaning it is less likely to underfit on the training data. The model is also computationally inexpensive to construct and is less susceptible to overfitting due to the presence of noise in data. The presence of redundant data also will not adversely affect the accuracy of decision trees, which is highly likely in this case since there are so many attribute variables of similarity. Furthermore, once the model has been built, testing on unseen data is fast, with a worst-case time complexity of O(w) (w = max_depth).
The intitial model has a default parameter value of 'none' for max_depth, but we will implement an alternate version of the same model while changing the value after.
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
dt = tree.DecisionTreeClassifier()
scores=cross_val_score(dt, X_train, y_train, scoring='accuracy', cv=5)
print('accuracy for 5 folds: ', scores)
print('mean: ', np.mean(scores))
dt.max_depth=3 #Testing using a max_depth of 3
scores=cross_val_score(dt, X_train, y_train, scoring='accuracy', cv=5)
print('accuracy for 5 folds: ', scores)
print('mean: ', np.mean(scores))
The default max_depth of 'none' produces a worse outcome relatie to the second test using a max_depth of 3. This was expected since the algorithm relies on recursively scaling the tree, and a max depth of 0 will not allow for this. One downside to increasing max_depth is that it can lead to overfitting. However, there are methods that can be employed such as post-pruning, GridSearch and ensemble methods which we will see later in this report.
param_grid = [
{'max_depth': [2, 3, 4, 5, 6, 7, 8]} ]
grid_search = GridSearchCV(dt, param_grid, cv=5,
scoring='accuracy',
return_train_score=True,
n_jobs=-1)
grid_search.fit(X_train, y_train)
print('best parameter values', grid_search.best_params_)
print('best estimator', grid_search.best_estimator_)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.mean(mean_score), params)
results_df = pd.DataFrame(cvres)
sns.set_style("whitegrid")
sns.lineplot(x='param_max_depth', y='mean_test_score', data=results_df)
plt.title('Accuracy by Max Depth')
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.show()
We can see that as max_depth increases beyond the optimal amount (4), it starts to underfit and the accuracy decreases with each increase in depth. This means that the model isn't complex enough, meaning it will not generalise well on unseen data.
The next series of algorithms are ensemble based algorithms, which is an elaboration of the aforementioned single classification algorithms. These models in particular work by aggregating the predictions of mltuple classifiers to improve classification accuracy, which tend to perform better than any single classification method such as nearest neighbours or decision trees. This methodology can be compared to that of the nearest neighbour algorithm, whereby classifications are justified on the basis of a majority rule. Instead of making a classification based on the justification using one model, we take an ensemble of individual models with a considerably higher error rate, and then minimise that error rate by taking a majority vote on the predictions made by those models.
Bagging is a technique that improves the generalization performance of models by reducing their variance. It does so by repeatedly sampling from a dataset with replacement to create multiple bootstrap samples of the same size as the original data. Each of these bootstrap samples is then used to train a separate base classifier, such as a decision stump.
After training k classifiers (number of classifiers), the final prediction for a test instance is obtained by combining the predictions of all the base classifiers. This is typically done by assigning the test instance to the class based on a majority vote for the base classifiers. For example, if two classes of pluses and circles are split by a decision boundary on a 2D plane, the bagged model will use the votes from multiple base classifiers to classify new data points.
Bagging improves the generalization error of base classifiers by reducing their variance. This is because the multiple bootstrap samples ensure that the base classifiers are trained on slightly different subsets of the original data, which reduces the impact of outliers and noise. However, the performance of bagging depends on the stability of the base classifiers, as unstable classifiers may be sensitive to small changes in the training data.
Overall, bagging is an effective technique for improving the generalization performance of models, especially when applied to noisy data or unstable classifiers. By combining multiple base classifiers trained on slightly different subsets of the data, bagging can create a more accurate and robust model that is better able to handle complex data.
The first model utilises the default value of 10 estimators, which is then multiplied by 10 in the second test to see if there is a considerable difference.
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bg=BaggingClassifier(DecisionTreeClassifier()) #Default = 10 estimators (trees)
scores=cross_val_score(bg, X_train, y_train, scoring='accuracy', cv=5)
print('accuracy for 5 folds: ', scores)
print('mean: ', np.mean(scores))
bg.n_estimators=100 # by default 10 estimators (trees) are used
scores=cross_val_score(bg, X_train, y_train, scoring='accuracy', cv=5)
print('accuracy for 5 folds: ', scores)
print('mean: ', np.mean(scores))
It's evident that an increase in estimators increases the models accuracy when observing the two tests carried out using a default value of 10 estimators followed by 100. The results show that an increase in n_estimators changes the mean accuracy from 80.2% accuracy to 81.5% accuracy, suggesting that increasing this parameter directly increases performance. However, an important note, which can also be observed by previous models in this report, is that an increase in estimators does not explicitly suggest an increase in performance. Increasing the number of estimators can lead to a point of diminishing returns, whereby a direct increase no longer leads to any significant improvement.
The next step will be to perform GridSearch using two hyperparameters: n_estimators and max_features.
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': np.arange(100,1000,100), #Using the np.arange function to generate a numpy array with values starting at 100, scaling up to 900 with 100 step intervals.
'max_features': [2, 3, 4, 6, 8, 10, 12, 14]
}
grid_search = GridSearchCV(bg, param_grid, cv=5,
scoring='accuracy',
return_train_score=True,
n_jobs=-1)
grid_search.fit(X_train, y_train)
print('best parameter values', grid_search.best_params_)
print('best estimator', grid_search.best_estimator_)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.mean(mean_score), params)
grid_search.best_params_ #best parameters for the model to utilise
max_features_list = [2, 3, 4, 6]
mean_scores = []
for mf in max_features_list:
mean_score = cvres['mean_test_score'][cvres['params'].index({'max_features': mf})]
mean_scores.append(mean_score)
plt.xlabel('Max Features')
plt.ylabel('Accuracy')
plt.title('Bagging Classifier Grid Search')
sns.lineplot(x=max_features_list, y=mean_scores)
A quick note is that we attempted to plot n_estimators and max_features relative to the accuracy, but it would repeatedly crash the program despite troubleshooting and trying various methods such as reducing the range of hyperparameters for the model to use. We had to resort to only including max_features in the data plot which still provides some useful insight on the model's performance. The data plot was also only able to scale up to 6 max_features since we were encountering issues for anything beyond this amount. However, we can still see the model's relative performance by observing the GridSearch output and the bestparams function: {'max_features': 14, 'n_estimators': 800}.
The model steadily increases in accuracy but the rate of increase tends to slow down after max_features = 3. We can see that in the GridSearch output, the accuracy steadily increases until around 10-12 max features. Since this is the case, this also tells us that increasing the max features parameter also increases the predictive power of the model, however if we increase the parameter too much overfitting is highly likely to occur.
The next algorithm is an alternate implementation of the boosting method called AdaBoost (adaptive boosting) which is yet another ensemble method used to tackle classification tasks. This is similar to the previous method, however instead of using a majority voting scheme to determine the classification target, the base classifiers C are trained on subsets on the data, and their weights are adjusted to improve the performance of the next classifier by a paramater α. This parameter is larger if the error rate is close to 0 and a large negative value if it's closer to 1. The weight update mechanism can be seen by the following equation:
from sklearn.ensemble import AdaBoostClassifier
ab=AdaBoostClassifier()
scores=cross_val_score(ab, X_train, y_train, scoring='accuracy', cv=5)
print('accuracy for 5 folds: ', scores)
print('mean: ', np.mean(scores))
ab.n_estimators=200
ab.learning_rate=0.5
scores=cross_val_score(ab, X_train, y_train, scoring='accuracy', cv=5)
print('acuracy on the 5 folds', scores)
print('and their mean', np.mean(scores))
This model has produced the highest accuracy thus far out of all the models trained on the data with a high of 82% in the initial test. The second test utilised 200 n_estimators and a learning rate of 0.5, which suggests that these two factors significantly diminish the accuracy score. However, by employing fine-tuning we'll discover that the number of estimators is in fact not the main outlier which returns a lower accuracy, but instead the learning rate.
param_grid = {
'n_estimators': np.arange(10,200,10),
'learning_rate': [0.01, 0.05, 0.1, 0.5, 1]
}
grid_search = GridSearchCV(ab, param_grid, scoring='accuracy', cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)
grid_search.best_params_
print('best parameter values', grid_search.best_params_)
print('best estimator', grid_search.best_estimator_)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.mean(mean_score), params)
By observing the output of GridSearch using an np.array scaling up to 200 and a learning rate ranging from 0.01 to 1 in the param_grid, it's evident that the learning rate has a drastic impact on the results. This can be justified by understanding the fundamentals of gradient descent and how learning rate helps the algorithm reach a global minimum.
In the image we can see how the learning rate can impact the possibility of reaching either a local or global minima. If the learning rate is lower, it takes longer to run but is more precise in reaching a minimum, whereas a higher learning rate is too fast and skip this minima. This is representative by observing the output above, which shows that accuracy decreases as the learning increases.
sns.set_style("whitegrid")
plt.figure(figsize=(10,5))
sns.lineplot(data=cvres, x="param_n_estimators", y="mean_test_score", hue="param_learning_rate", marker='o')
plt.title("AdaBoost Classifier Grid Search Results")
plt.xlabel("Number of Estimators")
plt.ylabel("Accuracy")
plt.show()
This data visualisation reinforces the idea that a lower learning rate will yield a much higher accuracy compared to a higher learning rate. This is especially true when considering the number of estimators, which shows that a higher learning rate will decrease in accuracy as n_estimators increases. However, the fact that the the lower learning rates remain stagnant as n_estimators increases suggests that the model is underfitting on the training data.
The final ensemble model in this classification task will be the random forest classifier algorithm. It works by combining the predictions made by multiple decision trees as an ensemble, generated based on the values of an independent set of random vectors, hence the name Random forest. Unlike AdaBoost, the probability distribution is fixed to generate these random vectors. The trees are fully grown without any pruning techniques in order to avoid any bias when constructing any ensuing decision tree. Once all trees have been constructed, the predictions are determined by a majority vote.
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
scores=cross_val_score(rf, X_train, y_train, scoring='accuracy', cv=5)
print('accuracy for 5 folds: ', scores)
print('mean: ', np.mean(scores))
rf.max_features= 2
rf.n_estimators=500
scores=cross_val_score(rf, X_train, y_train, scoring='accuracy', cv=5)
print('accuracy for 5 folds: ', scores)
print('mean: ', np.mean(scores))
The initial model utilising default parameters achieved a mean accuracy of 81.6%, whereas the 2nd model in which max_features and n_estimators have been set to custom values achieved a lower accuracy of 81.4%, a small decrease in performance. GridSearch will be a necessity to find the optimal parameters.
param_grid = [
{'max_features': [2, 3, 4, 6, 8, 10]} ]
grid_search = GridSearchCV(rf, param_grid, cv=5,
scoring='accuracy',
return_train_score=True, n_jobs=-1)
grid_search.fit(X_train, y_train)
print('best parameter values', grid_search.best_params_)
print('best estimator', grid_search.best_estimator_)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.mean(mean_score), params)
grid_search.best_params_
max_features_range = [2, 3, 4, 6, 8, 10]
accuracies = []
for f in max_features_range:
rf = RandomForestClassifier(max_features=f)
scores = cross_val_score(rf, X_train, y_train, scoring='accuracy', cv=5)
accuracies.append(np.mean(scores))
# Plot the accuracies vs max_features
plt.figure(figsize=(8,6))
sns.set_style('whitegrid')
axis = sns.lineplot(x=max_features_range, y=accuracies, linewidth=2)
axis.set_xlabel('Max Features')
axis.set_ylabel('Accuracy')
axis.set_title('Random Forest Classifier Accuracy vs Max Features')
plt.show()
Although a little irrespective of the GridSearch output, the graph shows that the model peaks at 6 max_features, with an accuracy of 81.77%. In this case, it's evident that the model is overfitting on the training data, which is inconsistent when considering the other models which usually underfit on the training data. Overfitting on the training data will lead to poor generalisation on unseen data, and so it will be more suitable to pick a stronger model.
The 6th and final algorithm will be the Support Vector Machine (SVM), which is not an ensemble based algorithm like that of the three previous algorithms implemented. Before explaining how the model works, it's first important to understand the notion of a hyperplane in this context - A plane of dimension one less than the dimension of data space which divides the classes of data. The SVM works by finding this hyperplane to separate the two classes in such a way that it maximises the margin between them. The hyperplane serves as a decision boundary which separates data classes, and the margin is the distance between the hyperplane and the nearest data points of each class.
The SVM algorithm works trying to maximise this margin, since decision boundaries with large margins tend to have better generalisation errors. Therefore, models with small margins are more susceptible to overfitting.
The algorithm can excel in finding a global minimum to reduce the cost function, which differs from other classification methods that operate on a greedy-based approach that tend to only find the local minimum.
# Create an SVM model
from sklearn import svm
from sklearn.svm import SVC
sVM = SVC()
scores=cross_val_score(sVM, X_train, y_train, scoring='accuracy', cv=5)
print('accuracy for 5 folds: ', scores)
print('mean: ', np.mean(scores))
params = sVM.get_params() #Fetching parameters for SVM
print(params)
param_grid = [
{'C': [0.01, 0.1, 0.2, 0.5, 0.8, 1, 5, 10, 20, 50]} ]
grid_search = GridSearchCV(sVM, param_grid, cv=5,
scoring='accuracy',
return_train_score=True,
n_jobs=-1)
grid_search.fit(X_train, y_train)
print('best parameter values', grid_search.best_params_)
print('best estimator', grid_search.best_estimator_)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.mean(mean_score), params)
param_grid = [{'C': [0.01, 0.1, 0.2, 0.5, 0.8, 1, 5, 10, 20, 50]}]
sVM = SVC(kernel='rbf', gamma='auto')
grid_search = GridSearchCV(sVM, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
mean_scores = grid_search.cv_results_['mean_test_score']
C_values = [x['C'] for x in grid_search.cv_results_['params']]
# plot the mean test scores against C values
sns.set_style('darkgrid')
plt.figure(figsize=(8, 6))
sns.lineplot(x=C_values, y=mean_scores)
plt.title('Accuracy vs. C')
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.show()
By observing the GridSearch output, it's evident that any optimisation parameter C beyond 1 produces considerably worse results. This can also been seen in the graph plot, which shows that there is a sizable drop in accuracy as C increases. As explained in the outset of the support vector machine section, the objective is to achieve a larger margin which leads to overall lower generalisation errors on training and testing on unseen data. Conversely, a larger regularisation parameter C leads to a smaller margin, which will in turn lead to a smaller margin and produce a worse trade-off variance.
The model is overfitting as C increases, meaning that the decision boundary is too complex and is responding to the presence of noise in the training data.
All models have been constructed, and so the final step in this classification task is to use the best model on the test set, which is unseen data and provides a more realistic benchmark of the models' performance since the data is unseen. Up until now, we have evaluated on just the accuracy of each model. Since this is a classification task, it's also important to take into account other success measures such as precision, recall and the f1 score.
To summarise:
Accuracy can be defined as - The rate of correction predictions made by the model.
Similarly, Precision is - The rate of correct positive predictions out of the all instances that are classified as positive.
Recall - Rate of detection of the positive class by the model.
F1 Rate - Harmonic mean of precision and recall.
Out of all models, AdaBoost had the best results with the use of GridSearch and checking for the most optimal parameters. This was judged by evaluating all models' accuracy score on the 5 folds using cross validation, and then tuning hyperparameters followed by implementing GridSearchCV to find the most optimal parameters. In addition to the models' relative performance on the training data, the algorithm is also computationally inexpensive to run as an ensemble model, which typically outperforms single classification models, which is evident in this data mining analysis. This means that it will be efficient to run on future instances of data. The code below automatically selects the best parameters to be used by the AdaBoost model and then evaluates it on the test set.
Below F1, accuracy, precision and recall scores are defined and printed using the test features and labels. A confusion matrix can also be seen below the test results.
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score
best_ab = grid_search.best_estimator_
y_pred = best_ab.predict(X_test)
f1 = f1_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
# Evaluation metrics
print("Evaluation metrics on the test set:")
print("F1 score:", f1)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
from sklearn.metrics import confusion_matrix
#Counts of test records correctly and incorrectly tabulated in a table called a confusion matrix:
cm=confusion_matrix(y_test, y_pred) # confusion matrix
print('confusion matrix: Actual Values on rows, predicted values on columns \n', cm)
The confusion matrix shows that the model was able to correctly classify true negatives and true positives in a sufficient manner. However, the number of false positives is quite high, which is also respective of the test results.
Before beginning this classification task, we knew that this would be a difficult analysis to tackle, since the dataset was so large. Utilising 23 attribute variables to predict one variable was never going to be an easy task, but despite this, I am still satisfied with how the process went, which includes the data inspection/visualisation, data preprocessing, model building and evaluation. Overall, the final model didn't do as well as we expected, achieving a relatively low f1 score and recall, but the model is still able to perform relatively well on unseen data with its high accuracy and precision scores.
The data mining algorithms implemented were extremely complex to understand, and perhaps a better understanding of the models could have allowed for better optimisation of hyperparameters/model building.
In hindsight, there were plenty of techniques that I didn't employ which could have boosted these scores but a significant margin. For example, our data preprocessing options could have explored instead of just simply splitting into the data into their respective feature and label sets, such as categorical encoding or removing variables with a low correlation.
Despite not achieving a score we hoped for, I learned a plethora of lessons during this classification task analysis and will be able to apply this knowledge and experience in the future.
Tan, P.-N. et al. (2020) Introduction to data mining. NY, NY: Pearson.
Data mining labs/lectures